RNGSSELIB: Program library for random number generation, SSE2 

realization 



L.Yu. Barash, L.N. Shchur 

Landau Institute for Theoretical Physics, 14-2432 Chernogolovka, Russia 



Abstract 

The library RNGSSELIB for random number generators (RNGs) based upon the SSE2 command set is 
presented. The Ubrary contains reahzation of a number of modern and most reUable generators. Usage 
of SSE2 command set ahows to substantiahy improve performance of the generators. Three new RNG 
reahzations are also constructed. We present detailed analysis of the speed depending on compiler usage 
and associated optimization level, as well as results of extensive statistical testing for all generators using 
available test packages. Fast SSE implementations produce exactly the same output sequence as the original 
algorithms. 
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Keywords: Statistical methods, Monte Garlo, Random 
numbers. Pseudorandom numbers. Random number 
generation. Streaming SIMD Extensions 
Classification: 4.13 Statistical Methods 
External routines/libraries: 
Subprograms used: 

Nature of problem: Any calculation requiring uniform 
pseudorandom number generator, in particular, Monte 
Garlo calculations. 

Solution method: The library contains realization 
of a number of modern and reliable generators: 
mtl9937, mrg32k3a and If srllS. Also new realizations 
for the method based on parallel evolution of an 
ensemble of dynamical systems are constructed: GM19, 
GM31 and GM61. The library contains both usual 
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realizations and realizations based on SSE command 
set. Usage of SSE commands allows to substantially 
improve performance of all generators. 

Restrictions: For SSE realizations of the genera- 
tors, Intel or AMD GPU supporting SSE2 command 
set is required. In order to use the realization 
If srllSsse, GPU must support SSE4 command set. 

Unusual features: 

Additional comments: 

Running time: Running time is of the order of 
20 sec for generating 10' pseudorandom numbers with 
a PG based on Intel Gore 17-940 GPU. Running time 
is analysed in detail in Sec. 5 of the paper. 



1. Introduction 

Pseudo random numbers, generated recursively 
by deterministic rules, represent one of important 
ingredients of numerical simulations widely used in 
physics and material science [1]. Design of the ran- 
dom number generators (RNG) producing pseudo 
random numbers which approximate "true random- 
ness" [2] is a great challenge for the computational 
physics and computer science in general. 
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High speed and statistical robustness are the 
most important characteristics needed for good 
RNG. Also, long period length of the generated se- 
quences is among the most desired features. 

In the paper we present a library of modern and 
most reliable RNGs known today. Namely, the 
realizations of the modern generators MT19937, 
MRG32k3a, LFSR113, GM19SSE, GM31SSE and 
GM61SSE are presented. MT19937 is the 2002 
version of the Mersenne Twister generator of Mat- 
sumoto and Nishimura 0], which is based on 
the recent generalizations to the GFSR method. 
MRG32k3a is the combined multiple recursive 
generator proposed in Q, and LFSR113 is the 
combined Tausworthe generator of L'Ecuyer Q. 
Each of the generators GM19SSE, GM31SSE and 
GM61SSE is based on an ensemble of sequences 
generated by multiple recursive method. The 
method is closely related to the method of random 
number generation based on evolution of the ensem- 
ble of the chaotic dynamical systems ^] . We intro- 
duce three RNG realizations based on the method: 
GM19SSE, GM31SSE and GM61SSE. 

An important feature of our library is the ability 
to speed up the RNG generation using Streaming 
SIMD Extensions 2 (SSE2) technology, introduced 
in Intel Pentium 4 processors in 2001 0. AMD 
added SSE2 support to their processors with the in- 
troduction of their Opteron and Athlon64 ranges of 
64-bit CPUs in 2003 All of Intel and AMD pro- 
cessors, fabricated later than 2003, support SSE2 
instruction set. SSE2 technology allows to use 
128-bit XMM-registers effectively and to accelerate 
computations. A similar technique was previously 
used for implementing some RNGs 0, @]- In the 
present work we demonstrate that this approach 
can essentially speed up most RNGs. 

Program codes for the generators and proper ini- 
tializations can be found in JLQ]. We would like to 
stress that our SSE implementations of algorithms 
produce exactly the same output sequence as the 
original algorithms. We would appreciate any com- 
ments on user experiences. 

2. Classical and modern random number 
generators 

The most widely used RNGs can be divided into 
two classes. The first class is represented by linear 
congruential generator (LCG), and the second by 
shift register (SR), or Tausworthe generator. 



Linear Congruential Generators (LCGs) are the 
best-known and (still) most widely available RNGs 
in use today. An example of the realization of 
an LCG generator is the UNIX rand generator 
y„ = (1103515245 y„_i + 12345) (mod 23^). The 
practical recommendation is that LCGs should be 
avoided for applications dealing with the geomet- 
ric behavior of random vectors in high dimensions 
because of the bad geometric structure of the vec- 
tors that they produce 0, [Hi- Another problem 
is that classical LCGs have relatively small period 
lengths. Also, the lower-order bits of the gener- 
ated sequence have a far shorter period than the 
sequence as a whole if the modulo is set to a power 
of 2. An immediate generalization of LCG is Mul- 
tiple Recursive Generator (MRG). 

Generalized Feedback Shift Register (GFSR) se- 
quences are widely used in many areas of compu- 
tational and simulational physics [l^, [l3|. These 
RNGs are quite fast and possess huge periods given 
a proper choice of the underlying primitive trino- 
mial. This makes them particularly well suited for 
applications that require many pseudorandom num- 
bers. But several fiaws have been observed in the 
statistical properties of these generators, which can 
result in systematic errors in Monte Carlo simu- 
lations. Typical examples include the Wolff sin- 
gle cluster algorithm simulations for the 2D Ising 
model [3] and 3D Isingmodel [l^, random and 
self-avoiding walks [lil Il7j. and the 3D Blume- 
Capel model using local Metropolis updating [l8| . 

Modern RNGs are modifications and generaliza- 
tions to the LCG and GFSR methods, they have 
much better periodic and statistical properties [l9l | . 
In this work we discuss properties, efficient realiza- 
tions and statistical tests for some of them. 

2.1. Mersenne Twister random number generator 
MT 19937 

Algorithm MT19937 is a Mersenne Twister (MT) 
generator by Matsumoto and Nishimura Q. In 
a sense, MT algorithm represents a modified and 
twisted GFSR generator. MT generates the vec- 
tors of word size by the recurrence 

Xfc+„ Xk+m ® (x^|xi+i)A. (1) 

For MT19937 values of parameters are chosen as 
follows: n = 624, m = 397, {xl\x.[^^) is the 32- 
bit integer obtained by concatenating one upper 
bit of Xfc and 31 lower bits of (x^^-^), and © is 
exclusive or operation. A is a companion matrix: 
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xA = shiftright(x) if the least significant bit of x 
is 0, otherwise xA = shiftright(x) © a, where vec- 
tor a consists of 32 bits and represents 32-bit inte- 
ger constant 2567483615. MT19937 is a generator 
with a very long period 2i^^37 _ and it provides 
623-dimensional equidistribution of pseudorandom 
numbers up to 32-bit accuracy. As for most gener- 
ators, Mersenne Twister is sensitive to poor initial- 
ization, so proper initialization is very important. 

2.2. Combined Multiple Recursive Generator 
MRG32k3a 

Algorithm MRG32k3a represents a Combined 
Multiple Recursive Generator (CMRG) found in [3| 
by means of extensive search of parameters for 
such generator. MRG32k3a algorithm combines 
two MRGs (see right column in Table [3]). The state 
of the first MRG is described by variables xO, xl, 
x2, whereas the variables yO, yl, y2 represent the 
state of the second MRG. At each step, the MRG 
states are shifted as shown in the right column of 
Table [31 where the calculated values of pi and p2 
are the new outputs of the MRGs. 

2.3. Combined Tausworthe generator LFSR113 
Another modern generator is a combined Taus- 
worthe generator, i.e., a combined GFSR sequence. 
In Q an extensive search for good parameters for 
such generator is carried out, and, as a result, an al- 
gorithm LFSR113 for such generator is presented. 
Right column in Table U describes the algorithm. 
The state of the generator is described by variables 
zl,z2,z3,z4. 

2.4. Generators GM19, GM31, and GM61 based on 
the ensemble of MRGs 

Finally, we include in the library improved ver- 
sions of the generators GM19 and GM31 worked 
out in 0, and also the new realization GM61. It 
is suggested in Q to construct RNGs based on an 
ensemble of hyperbolic automorphisms of the unit 
two-dimensional torus. 
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where i = 0, 1, . . . , (s — 1), the elements of two- 
dimensional matrix M are integers, det M = 1, and 
|TrAf| > 2. The output of the generator is defined 
as 



(3) 



Dynamical system ^ is widely known under name 
cat map (there are two reasons for this terminology: 
first, CAT is an acronym for Continuous Automor- 
phism of the Torus; second, the chaotic behavior 
of these maps is traditionally described by showing 
the result of their action on the face of the cat (20jV 
The main properties of the generator based on cat 
maps can be predicted theoretically because there 
is close relation between the periodic properties of 
cat maps and arithmetical properties of algebraic 
numbers Q. The basic recurrence represent- 
ing an evolution of a simple nonlinear dynamical 
system on a discrete g x g lattice, is equivalent to 
linear recurrence modulo g with the same charac- 



teristic polynomial f{x) 
the matrix M: 



kx + q as that of 
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(mod g), (4) 
(mod g), (5) 
Mini. There- 



where k = TiM, g = det M = 1 ^ 

fore, in a sense, the method represents a modified 
and generalized MRG, where only part of the infor- 
mation of a generator state influences the generator 
output. 



Table 1: 
GM61. 



Parameters of the generators GM19, GM31 and 



Generator 


k 


q 


9 


Period 


GM19 


15 


28 


2^y - 1 


2.7- IQii 


GM31 


11 


14 


231-1 


4.6- IQi** 


GM61 


24 


74 


261-1 


5.3-1036 



The modified RNG based on the same idea, 
where lattice parameter g is chosen to be Mersenne 
prime, and the characteristic polynomial f{x) = 
x^ — kx + q is chosen to be primitive over Zg, pos- 
sess very good statistical and periodic properties. 
In this case the basic recurrence is closely related to 
so-called matrix generator of pseudorandom num- 
bers studied in |2 
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i=0 



23| . Primitivity of the char- 
acteristic polynomial guarantees maximal possible 
period g^ — 1 oi the output sequence, also it is re- 
quired for the primitivity of f{x) that q ^ 1. Ta- 
ble [T] shows parameter values chosen for generators 
GM19, GM31 and GM61. For generating random 
numbers it is convenient to employ the linear recur- 
rence (|4|). As above, the output function is defined 
with i.e. each bit of the output corresponds 
to its own recurrence, and s = 32 recurrences are 
calculated in parallel. 

There is an easy algorithm to calculate in 
(|4]) very quickly from x^^^ and a;'i-' for any large 



3 



n. Indeed, if x'^^"-' = knx'^'^^ — qnx'^^^ (mod g), 
then a;(4") = (^2 _ 2g„)a;(2") - g2^(o) (mod g). As 

was mentioned already in Q , this helps to initialize 
the generator. To initialize all 32 recurrences, the 
following initial conditions are used: xf^ — x^^^\ 



Table 2: SSE2 realization for the main part of algorithm 
MT19937SSE (tempering is omitted in the table). 



.(1) 



X 



0,1,..., 31. Here A is a value of 



the order of (p^ — 1)/32, which is not very close to a 
divisor of p'^ — 1 or to a large power of 2. 

3. SSE-realizations of RNGs 

Modern Intel and AMD processors support SSE2 
instruction set. SSE2 is acronym for Streaming 
SIMD (Single Instruction Multiple Data) Exten- 
sions 2. SSE2 realization of instruction set allows 
user to perform the same operation in parallel on 
four 32-bit numbers placed in 128-bit XMM reg- 
isters, or on two 64-bit numbers [7]. It is known 
that proper programming of 32-bit RNG using SSE 
command set may essentially decrease computation 
time needed for generating random number. For ex- 
ample, in paper [9] SSE realization of the combined 
RNG based on two Shift Register generators with 
shift register lengths 9689 and 4423 accelerated ran- 
dom number generation by factor 3.5. 

In this section we describe realization of ran- 
dom number generators discussed in the previous 
section using SSE2 command set. These effective 
realizations are equivalent to classical algorithms 
discussed in previous section, i.e. all output val- 
ues are the same. For example, MT19937SSE al- 
gorithm produces the same output values as well- 
known MT19937 introduced in fs*], in contrast to 
other Mersenne Twister SSE algorithms introduced 
in 
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Program codes for all realizations and examples 
of practical implementations, including proper ini- 
tialization, can be found in [lO|. 

3.1. SSE2 based MT19937 

In Table[2]we present main ideas of our version for 
MT19937 algorithm (see SecH]), based on SSE2 
command set. The whole array of 624 doubleword 
integers is divided into fours of integers, and four 
recurrent relations ([1} are calculated in parallel. 

3.2. SSE2 based MRG32k3a 

Right column in Table [3] shows usual MRG32k3a 
algorithm (see Sec. 12. 2p . The recurrent relations 
are: Xi+3 = axi + bxi+i (mod mi) and yi+3 = 
cyi + (mod TO2), where a — —810728, 6 = 



asmC movaps 


2500(/o0) , /o/oxmm0\n" \ 


"movaps 


/I / 0/ ri \ 0/0/ J \ it \ 

4(/o0; , /o/oXmml\n" \ 


"pand 


upper_mask , °/o°/oXmmO\n" \ 


"pand 


lower_mask, /o/oXmml\n \ 


"por 


"/ 0/ f\ 0/ 0/ H \ „ tl \ 

/o/oXmmO , /o/oxmml \n \ 


"movaps 


0/0/ J 0/0/ r\\ II \ 

/o/oxmml , /o/oXmmO\n" \ 


"psrld 


$1 , /o/oXmml\n" \ 


"pand 


andl_mask,7o7oXmmO\n" \ 


"pcmpgtd 


zero_mask,7o7oXmmO\n" \ 


"pand 


andA_mask,7o7xmmO\n" \ 


"pxor 


7o7oXmml , 7o7oXmmO\n " \ 


"movaps 


1588(7.0) ,7.7.xmml\n" \ 


" cmpl 


$227,7,2\n" \ 


"jl 


MyLl7.=\n" \ 


"movaps 


-908(7.0) ,7.7.xmml\n" \ 


"My LI 7.= : 


pxor 7.7.xmml,7.7.xmmO\n" \ 


"movups 


7.7.xmmO,(7.0)\n" \ 


"movaps 


7.7.xmmO, 2500 (7.0) \n" \ 


" " : : "r" (mt+i) , "r" (consts) , "r" (i) ) ; 



1403580, c = -1370589, d = 527612. There- 
fore, Xi^4 ~ axi + bx2 (mod mi) and Xi+e — 
aoxo + aixi + 02X2 (mod mi), 2/^+4 = ^o2/o + 
/3iyi + I32y2 (mod to2), y^+g = 7o2/o + 7iyi + 
722/2 (mod m2), where constants ao, ai, a2, /?o, 
132, 70i 71 and 72 can be easily calculated. This 
allows to calculate X4, x^, xq and xj from xq,xi,X2, 
and also 7/4, 3/5, ye and yr from yo, 2/1, 2/2- Calcula- 
tion of fours of numbers was found to be very ef- 
fective. Main parts of the SSE-realization of the 
MRG32k3a algorithm are shown in the left column 
in Table El 

Since there are no SSE commands for taking 
modulo in XMM registers, we use the relation 
z = {f ■ N + g) (mod m) for 2: = / • 2^^ -|- g and 
N = — We calculate zi, Z2, z^ using formulas 

ZO^ 232 ^ ^ ^ ^ . ^ 

Z2 = fi ■ N + gi = f2 ■ 2-'^2 + if Z2 < m then 
Z3 = Z2 else Z3 = Z2 — m. This results in our case 
in Z3 < m. SSE commands admit simultaneous cal- 
culation of x4 (mod ml) and x5 (mod m2). Finally 
the RNG output is calculated as pi - p2 + ml for 
pi < p2 and pi - p2 for pi > p2. 

3.3. SSE4 based LFSR113 

Our realizations of LFSR113 and LFSR113SSE 
(see SecEJl) are presented in Table gl The SSE 
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Table 6: Example of using the realization MRG32K3A-SSE. 

#include<stdio . h> 
#include"mrg32k3asse .h" 

int main(){ 

unsigned i; mrg32k3asse_state state; 
mrg32k3asse_SetState(&state) ; 
mrg32k3asse_genlnit ( 123 , 123 , 123 , 123 , 123 , 123) ; 
mrg32k3asse_genPrintState ; 

f or (i=0 ; i<99999999 ; i++) mrg32k3asse_genRaiid ( ) ; 
printf("10~8 Numbers Generated\n") ; 
printfC'Next Output Value: /.uXn" , 

mrg32k3asse_genRand() ) ; 
mrg32k3asse_genPrintState ; 
return 0; 



realization utilizes pmulld and pblendw processor 
instructions of SSE4 command set. Therefore, the 
realization LFSR113SSE requires CPU supporting 
SSE4, for example, a processor with Intel Core mi- 
croarchitecture. The realization works only for In- 
tel processors and is not valid for AMD processors. 
For shifting four integers of an XMM register to the 
left by different numbers of bits, we use the SSE4 
instruction pmulld. For shifting to the right by 
different numbers of bits, we use SSE4 command 
pblendw. The details of the algorithms are shown 
in Table H 

3.4. SSE2 based GM19, GM31 and GM61 

Table [5] illustrates the key ideas for speeding up 
algorithms GM19, GM31 and GM61 (see Sec. HH) 
using SSE2 command set. Actions of the fast SSE2 
algorithms shown in the left column are equivalent 
to actions of the slow algorithms shown in the right 
column. 

4. How to use the library. Function call in- 
terface. 

11. Realization MRG32K3A-SSE 

Table |6] shows the example of using the 
MRG32K3A-SSE realization in ANSI C language. 
Tabic [7] shows header file names for the generators. 
Using the realization requires including the header 
file mrg32k3asse .h in the code. 

The function mrg32k3asse_SetState ( . . . ) in 
Table [5] saves the pointer to the structure state in 
order that the other functions use it as a pointer to 



the structure that keeps the generator state. Such 
function should be executed prior to using any other 
functions that manipulate with a generator state. It 
can also be used in order to switch between using 
several RNGs of the same type. Table |S] describes 
the function call interface. 

The function mrg32k3asse_genlnit (...) in Ta- 
ble [6] initializes the generator state. Table |9] shows 
details for the initialization function. The values 
of xO, xl, x2, yO, yl, y2 described in Sec. 12.21 and 
Table |3] are required to initialize any version of the 
generator MRG32K3A. 

The function mrg32k3asse_genRand() in TablelH] 
generates a pseudorandom 32-bit integer and makes 
respective transformation to the generator state. 
Detailed interface for this function is presented in 
Table El 

The function mrg32k3asse_genR£Lnd4 (unsigned* 
arr) generates block of four pseudorandom 
32-bit integers and puts them in arr [0] , 
arr[l], arr [2] and arr [3] . The function 
mrg32k3asse_genRand32 (unsigned* arr) gen- 
erates block of 32 pseudorandom 32-bit integers 
and puts them in arr[0], arr[l], . . ., arr [31]. 
Detailed interface for the functions is presented in 
Table [71 Generation of blocks of numbers is further 
discussed in Sec. and Tables [T^ and [Til 

The function mrg32k3asse_genPrintState () 
prints out the generator state. The function call 
interface is shown in Table [H 

4.2. Realization MRG32K3A 

The similar example as in Table [SI for the real- 
ization MRG32K3A, which is exact reproduction of 
the algorithm published in 4] and is based on usual 
command set, can be found in the file mrg32k3a.c 
in flO^. 

The header file for the realization is mrg32k3a.h, 
as it is shown in Table [71 

Table [HI describes the function call interface for 
the function inrg32k3a_SetState ( . . .) that saves 
a pointer to the generator state. 

Table [SI shows details for the initialization func- 
tions mrg32k3a_genlnit ( . . .). The values of xO, 
xl, x2, yO, yl, y2 described in Sec. [O and Table [31 
are required to initialize the generator MRG32K3A. 

The function mrg32k3a_genRcLnd() generates a 
pseudorandom floating point number between and 
1 and makes respective transformation to the gener- 
ator state. The function call interface is presented 
in Table [71 
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Table 3: Realizations MRG32K3aSSE (left column) and MRG32k3a (right column) producing equivalent outputs. 



main parts of SSE2-based MRG32k3aSSE 



MRG32k3a-L with x86-64 command set 



unsigned s[16] attribute ( (aligned(16) ) ) ; 

// { xO,0,xl,0, x2,0,x3,0, yO,0,yl,0, y2,0,y3,0 > 

unsigned consts[72] attribute ( (aligned(16) ) ) = 

{ x4ap,0,x4ap,0, x4bm , , x4bm , , 

x4cp , , x4cp , , x4addl , x4addh , x4addl , x4addhL , 

x6ap,0,x6ap,0, x6bp,0,x6bp,0, 

x6cm , , x6cm , , x6addl , x6addli , x6addl , xSaddh , 

ml,0,ml,0, [...] 

}; 

II Calculating x4 and x5 from xl and x2: 

asmC'movups SC/.O) //.7.xmml\n"\ 
"movaps 16(7.0) ,y."/.ximn2\n"\ 
"pmuludq 16(7.1) ,7.7.ximnl\n"\ 
"pmuludq 32(7.1) ,y.7.xinm2\n"\ 
"paddq 48(7.1) ,7.7.xmm2\n"\ 
"psubq 7.7.xmnil , 7.7.ximn2\n" \ 
" movaps 7.7.xmm2 , 7.7.xmm3\n " \ 
"psrlq $32,7.7.xmm3\n"\ 
"pmuludq 128(7.1) ,7.7.xmm3\n"\ 
"psubq 7.7.xmm3,7.7.xmm2\n"\ 
"movaps 7.7oXmm2 , 7o7.xmm3\n"\ 
"psrlq $32,7.7.xmm3\n"\ 
"pmuludq 128(7.1) ,7.7.xmm3\n"\ 
"psubq 7.7.xmm3,7.7.xmm2\n"\ 
"psubq 128(7.1) ,7.7.xmm2\n"\ 
"movaps 7.7.xmm2 , 7.7.xmm3\n" \ 
"psrad $31,7.7.xmm3\n"\ 
"psrlq $32,7.7.xmm3\n"\ 
"pand 128(7.1) ,7.7.xmm3\n"\ 
"paddq 7.7.xmm3,7.7.xmm2\n"\ 
"movaps 7.7.xmm2 , (7.0) \n" \ 
""::"r"(s),"r"(consts)); 

// Calculating (xO-yO) (mod ml) : 

asm("movaps (7.0) ,7,7.xmml\n"\ 
"psubq 32(7.0) ,7.7.xmml\n"\ 
"movaps 7.7.xmml , 7.7.xmm4\n" \ 
"psrad $31,7.7.xmm4\n"\ 
"psrlq $32,7.7.xmm4\n"\ 
"pand 128(7.1),7.7.xmm4\n"\ 
"paddq 7.7.xmm4,7.7.xmml\n"\ 
"": :"r"(s),"r"(consts)); 



const long long addl=9007203111403311U; 
const long long add2=9007202867859652U; 

// here ml divides addl, m2 divides add2 

const long long ml= 4294967087U; 
const long long m2= 4294944443U; 
const long long a = -810728; 
const long long b = 1403580; 
const long long c = -1370589; 
const long long d = 527612; 
long long xO, xl, x2, yO, yl, y2; 

unsigned mrg32k3a (){ 

long k; long long pi, p2; 

pi = (addl + b*xl + a*xO) % ml; 

xO = xl; xl = x2; x2 = pi; 

p2 = (add2 + d*y2 + c*yO) 7. m2; 

yO = yl; yl = y2; y2 = p2; 

if (pi <= p2) return (pi - p2 + ml) ; 
else return (pi - p2) ; 
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Table 4: Algorithms LFSR113 and LFSR113SSE 



SSE4 based LFSR113SSE 


LFSR113 based on usual instruction set 


unsigned z [4] attribute ( (aligned (16) ) ) 

= {12345,12345,12345,12345}; 

unsigned lfsrll3(){ 
unsigned output; 
asmC'movaps (7.1) ,7.7.xmml\n" \ 

"movaps (7,2) ,7.7.xmm2\n" \ 

"movaps (7.4) ,7.7.xmmO\n" \ 

"pand 7»7»xiniiil ,7«7»xinm2\n" \ 

"pmulld (7.3) ,7.7.xmm2\n" \ 

"pmulld 7»7»xmiiil ,7»7»xmiiiO\n" \ 

"pxor 7»7«xinmO,7»7»xinml\n" \ 

"psrld $12,7.7.xmml\n" \ 

"pblendw $192,%7.xmml,%7.xmm3\n" \ 

"psrld $l,7.7.xmml\n" \ 
pblendw $3 , /o/oxmml , A/oXmm3\n" \ 

"psrld $8,7.7.xmml\n" \ 

"pblendw $48 ,7.7.xmml ,7.7.xmm3\n" \ 

"psrld $6,7.7.xmml\n" \ 

"pblendw $12,%%xmml,7,y,xmm3\n" \ 

"pxor 7»7»xmm2,7»7»xmm3\n" \ 

"movaps 7.7.xmm3, (7.1)\n" \ 

"pshufd $255,7.7.xmm3,7.7.xmmO\n" \ 

"pshufd $170,7.7.xinm3,7.7.xinml\n" \ 

"pshufd $85 , 7o7oXmm3 , 7o7oXmm2\n" \ 

"pxor yo7oxmm0,7oyoxmm3\n" \ 

"pxor 7»7»xminl ,7o7oXinm2\n" \ 

"pxor 7o7oXmm2,7o7«xmm3\n" \ 

"pextrd $0,7.7.xinm3,7.0\n" \ 

" " : "=&r" (output) : "r" (z) , "r" (a) , 

"r" (b) ,"r"(c)) ; 
return output; 

} 


unsigned zl=12345,z2=12345, 
z3=12345,z4=12345; 

unsigned lfsrll3() 
{ 

unsigned long b; 

b = ((zl << 6; zl) >> 13; 

zl = ((zl & 4294967294U) « 18) ~ b; 

b = ((z2 « 2) - z2) » 27; 

z2 = ((z2 & 4294967288U) « 2) " b; 

b = ((z3 « 13) ~ z3) » 21; 

z3 = ((z3 & 4294967280U) « 7) " b; 

b = ((z4 « 3) - z4) » 12; 

z4 = ((z4 & 4294967168U) « 13) " b; 

return (zl " z2 " z3 " z4) ; 

> 
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Table 5: Equivalent realizations for several algorithms for processor with SSE2 support (left column) and in ANSI C language 
(right column). First row presents calculating four matrix multiplications. Second row presents the packing 16 high bits of 16 
integers into one integer. These or similar equivalences are used in constructing the SSE2 algorithms for GM19, GM31 and 
GM61 algorithms 



Four matrix multiplications in parallel, SSE2 version 



Same with usual instruction set 



unsigned long x[4] ,y[4] ; 



[. 



.] 



asm( "movaps 
"movaps 
"paddd 
"paddd 
"movaps 
"pslld 
"paddd 
"movaps 
"psubd 
"movaps 



(7.0) ,7J.xmmO\n" \ 

(7.1) ,y.y.xmml\n" \ 
77xmml,77xmmO\n" \ 
77xmml,77xmmO\n" \ 
77.xmmO,7,7.xmm2\n" \ 
$2,77xmm0\n" \ 
77xmml,77xmmO\n" \ 
77.xmmO, (7.0)\n" \ 
77xmm2,77xmm0\n" \ 
77.xmmO, (7.1)\n" \ 

(x) ,"r"(y)) ; 



unsigned long i,newx[4] ,x[4] ,y[4] ; 
[ ] 

f or(i=0; i<4; i++){ 

newx [i] =4*x [i] +9*y [i] ; 
y[i]=3*x[i]+7*y[i] ; 
X [i] =newx [i] ; 

> 



Packing 16 high bits into one integer, SSE2 version 



Same with usual instruction set 



unsigned long x [16] , output ; 



.] 



asm( 



movaps (71) ,77xmmO\n" \ 
movaps 16(7ol) ,7o7xmml\n" \ 
movaps 32(7,1) ,7o7oXmm2\n" \ 
movaps 48(7.1) ,7.7.xmm3\n" \ 
psrld $31,77xmm0\n" \ 
psrld $31,77xmml\n" \ 
psrld $31,77xmm2\n" \ 
psrld $31,77xmm3\n" \ 
packssdw 7o7oXmml ,7o7oXmmO\n" \ 
packssdw 7o7oXmm3,7o7oXmm2\n" \ 
packsswb 7o7oXmm2,7o7oXmmO\n" \ 
psllw $7,77xmm0\n" \ 
pmovmskb 7o7oXmmO,7oO\n" \ 
" :"=r" (output) :"r"(x)); 



const unsigned long half g=2147483648 ; 
unsigned long x [16] , i , output=0 ,bit=l ; 



[. 



.] 



f or(i=0; i<16; i++){ 

output += ( (x [i] <half g) ?0 : bit ; 
bit*=2; 

} 
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The function mrg32k3a_genPrintState () prints 
out the generator state. The function call interface 
is shown in Table [51 

4.3. Realizations MT 19937 and MT 19937- SSE 

The similar examples as in Table [5] for MT19937 
and MT19937-SSE can be found in files mtl9937 . c 
and mtl9937sse.c in [13] • 

The header files for the realizations are 
mtl9937.h and mtl9937sse . h, as it is shown in 
Table El 

Table [8] describes the function call interfaces 
for the functions mtl9937_SetState ( . . . ) and 
mtl9937sse_SetState ( . . . ) that save a pointer to 
the corresponding generator state. 

Table [9] shows details for the initializa- 
tion functions mtl9937_genlnit ( . . . ) and 
mtl9937sse_genlnit ( . . . ) . Versions of the 
MT19937 generator are initialized with four 
unsigned integer values (25| . 

The functions mtl9937_genRand() and 
mtl9937sse_genRarLd() generate a pseudorandom 
32-bit integer and make respective transformation 
to the corresponding generator state. Detailed 
interface for the functions is presented in Table [71 

The function mtl9937sse_genRand624() gen- 
erates block of 624 pseudorandom 32-bit un- 
signed integers and puts them in state->out [0] , 
state->out [1] , . . . , state->out [623] , where 
state->out is array of unsigned integers, and 
the pointer state to the structure of the type 
mtl9937sse_state should have previously been 
saved with the function mtl9937sse_SetState. 
Generation of blocks of numbers is further discussed 
in Sec. [H and Tables [HI and [H 

The functions mtl9937_genPrintState () and 
mtl9937sse_genPrintState print out the cor- 
responding generator state. Detailed interface for 
the functions is shown in Table [51 

44. Realizations LFSR113 and LFSR113-SSE 

The similar examples as in Table [SI for LFSR113 
and LFSR113-SSE can be found in files If srll3. c 
and lfsrll3sse.c in [lo| . 

The header files for the realizations are 
lfsrll3.h and If srll3sse .h, as it is shown in 
Table [3 

Table [51 describes the function call interfaces 
for the functions If srll3_SetState ( . . . ) and 
If srll3sse_SetState( . . . ) that save a pointer to 
the corresponding generator state. 



Table [HI shows details for the initializa- 
tion functions If srll3_genlnit ( . . . ) and 
If srll3sse_genlnit ( . . . ) . Initializing any 
version of the generator LFSR113 requires initial 
values of zl, z2, z3, z4 described in Sec. 12.31 and 
Table H for the LFSR113 algorithm. 

The functions If srll3_genRand() and 
If srll3sse_genRctnd() generate a pseudorandom 
32-bit integer and make respective transformation 
to the corresponding generator state. Detailed 
interface for the functions is presented in Table [71 

The functions If srll3_genPrintState () and 
If srll3sse_genPrintState print out the cor- 
responding generator state. Detailed interface for 
the functions is shown in Table [51 

4.5. Realizations GM19, GM19-SSE, GM31, 
GM31-SSE, GM61 and GM61-SSE 

The similar examples as in Table [51 for GM19, 
GM19-SSE, GM31, GM31-SSE, GM61 and GM61- 
SSE can be found in files gml9.c, gml9sse.c, 
gmSl . c, gm31sse . c, gm61 . c and gm61sse . c in [ 10| . 

The header files for the realizations are gml9.h, 
gml9sse.h, gin31.h, gmSlsse.h, gm61.h and 
gm61sse.h, as it is shown in Table [71 

Table [51 describes the function call inter- 
faces for the functions gml9_SetState ( . . . ) , 
gml9sse_SetState (...), gm31_SetState (...), 
gm31sse_SetState (...), gm61_SetState (...) 
and gm61sse_SetState( . . .) that save a pointer 
to the corresponding generator state. 

Table [9l shows details for the initial- 
ization functions gnil9_genlnit (...), 
gml9sse_genlnit (...), gm31_genlnit (...), 
gm31sse_genlnit (...), gm61_geiiliiit (...) and 
gm61sse_genlnit ( . . .). For generators GM19 and 
GM19SSE it is required that < seed < 9, for GM31 
and GM31SSE it is required that < seed < 99, for 
GM61 and GM61SSE - < seed < 999. 

The functions gml9_genRcLnd() , 

gml9sse_genRand() , gm31_genRcLnd() , 

gm31sse_genRand() , gm61_genRand() and 
gm61sse_genRand() generate a pseudorandom 
32-bit integer and make respective transformation 
to the corresponding generator state. Detailed 
interface for the functions is presented in Table [71 

The functions gml9_genPrintState () , 
gml9sse_genPrintState , 
gm31_genPrintState() , 
gm31sse_genPrintState , 

gm61_genPrintState() and 
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gm61sse_genPrintState print out the cor- 
responding generator state. Detailed interface for 
the functions is shown in Table [51 

5. Speed and properties 

CPU times needed for generating 10^ random 
numbers are shown in Tables [TOl E] and [12] for 
Intel Core 2 Duo E7400, Intel Core 17-940 and 
AMD Turion X2 RM-70 processors respectively. 
The results are presented for different compilers 
and optimization options for all RNGs of our li- 
brary. We use GNU C compiler gee version 4.3.3, 
and Intel C compiler ice version 11.0. In all 
cases we observe decrease of computation time for 
the modified algorithms as compared with origi- 
nal algorithms. In particular, using GNU C com- 
piler gcc with optimization level -00 we find that 
MRG32K3ASSE is about 4.5 times faster than 
MRG32K3A, whereas LFSR113SSE is about 40% 
faster than LFSR113 and MT19937SSE is about 3 
times faster than MT19937. Therefore, the mod- 
ified generators turn out to be very fast. Algo- 
rithms GM19SSE, GM31SSE and GM61SSE are 
10-22 times more efficient than corresponding al- 
gorithms based on usual command set, and become 
competitive with other modern generators. 

Table [T3l contains CPU-times for generating 10^ 
random numbers via blocks of numbers, both for 
our SSE2-based algorithm MT19937SSE and the 
MT 19937 generator from Intel MKL library The 
output values of the generator realizations are iden- 
tical. 

Table [TH contains CPU-times for generating 10^ 
random numbers via blocks of numbers, both for 
our SSE2-based algorithm MRG32K3ASSE and the 
MRG32K3A generator from Intel MKL library. 
The output values of the generator realizations are 
identical. 

6. Statistical properties and period lengths 

Applying hundreds of empirical tests (so-called 
batteries of tests) allows to estimate statistical ro- 
bustness of the generators. As we mentioned above, 
here we considered only generators with best statis- 
tical properties (see also Sec. 4.5.4 and Sec. 4.6.1 
in 0, and Sec. 3 in §). 

Table [TSl shows the results of applying the Small- 
Crush, PseudoDiehard, Crush and Bigcrush batter- 
ies of tests taken from [2^ , to the generators of our 



library. For each battery of tests. Table [TS] shows 
three numbers: the number of statistical tests 
with p-values outside the interval [10~^, 1 — 10~^], 
number of tests with p-values outside the interval 
[10~^, 1 — 10~^], and number of tests with p-values 
outside the interval [lO^^", 1 - lO'^^]. The last col- 
umn in Table [TS] shows the RNG period lengths. 

Libraries SmallCrush, PseudoDiehard, Crush 
and Bigcrush contain 15, 126, 144 and 160 tests re- 
spectively. We consider the statistical test "failed" 
if the p- value lies outside the region [10"'^, 1 — 10"'^]. 
Varying free initial conditions of the generators 
GM19, GM31 and GM61 and applying statistical 
tests, one can observe occasionally a single failed 
test with p-value in the region [10~^, 10"'^] U [1 — 
lO^'^, 1 — 10^^]. Since the number of applied tests 
is of the order of 10"^, such a single failed test can 
be explained simply by random statistical fiuctua- 
tions, which do not characterize the quality of ran- 
dom numbers. As seen, all generators possess excel- 
lent statistical properties. Table [T51 shows that the 
generators, except LFSR113 and MT19937 pass all 
statistical tests in the TestUOl library. Such gener- 
ators can be recommended for practical use. 

As was mentioned in [S^l, the generators 
MT19937 and LFSR113 fail tests that try to find a 
linear structure in the bits of the output, because 
their bits have a linear structure by construction. 
Hence this particular point does not mean serious 
deficiencies of the generator MT19937. MT19937 
passes all other tests. Namely, both of MT19937 
and LFSR113 fail the test scomp_LinearComp, and 
also LFSR113 fails the test smarsa_MatrixRajik. 

This work was supported by Russian Foundation 
for Basic Research. 
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Tabic 7; Function call interfaces for generating 32-bit pseudorandom numbers. 



Generator name 


File to include 


State type 


Generate 32-bit random number(s) 


GM19 


giiil9.h 


ginl9_state 


unsigned gml9_genRand() ; 


GM19 (SSE) 


gml9sse .h 


ginl9sse_state 


tmsigned gml9sse_genRand() ; 


GM31 


gm31.h 


gm31_state 


unsigned gin31_genRaiid() ; 


GM31 (SSE) 


gm31sse.h 


gm31sse_state 


imsigned gm31sse_genRaiid() ; 


GM61 


gm61 .h 


gm61_state 


unsigned gm61_genRaiid() ; 


GM61 (SSE) 


gm61sse . h 


gm61sse_state 


unsigned gm61sse_genRand() ; 


LSFR113 


lfsrll3.h 


lfsrll3_state 


unsigned If srll3_genRand() ; 


LFSR113 (SSE) 


If srll3sse.h 


If srll3sse_state 


unsigned If srll3sse_genRand() ; 


MRG32K3A 


mrg32k3a. h 


mrg32k3a_state 


double mrg32k3a_genRand() ; 


MRG32K3A (SSE) 


mrg32k3asse.h 


mrg32k3asse_state 


unsigned mrg32k3asse_genRaiid() ; 


MRG32K3A (SSE, 4) 


mrg32k3asse .h 


mrg32k3asse_state 


void mrg32k3asse_genRaiid4(unsigned* arr) ; 


MRG32K3A (SSE, 32) 


mrg32k3asse32 . h 


mrg32k3asse32_state 


void mrg32k3asse32_genRand32 (unsigned* arr); 


MT19937 


mt 19937. h 


mtl9937_state 


unsigned long mtl9937_genRand() ; 


MT19937 (SSE) 


iiitl9937sse.h 


mtl9937sse_state 


unsigned intl9937sse_genRand() ; 


MT19937 (SSE, 624) 


iiitl9937sse.h 


mtl9937sse_state 


void mtl9937sse_genRand624() ; 



Table 8: Function call interfaces for keeping the pointer to a state and for printing out a generator state. 



Set state 


Print out state 


void gml9_SetState (gml9_state* state); 
void gml9sse_SetState(gml9sse_state* state); 

void gm31_SetState(gm31_state* state); 
void gm31sse_SetState(gm31sse_state* state); 

void gm61_SetState (gm61_state* state); 
void gm61sse_SetState(gm61sse_state* state); 
void If srll3_SetState(lf srll3_state* state); 
void If srll3sse_SetState(lf srll3sse_state* state); 
void inrg32k3a_SetState(inrg32k3a_state* state); 
void mrg32k3asse_SetState(mrg32k3asse_state* state); 
void mrg32k3asse32_SetState(inrg32k3asse32_state* state); 
void mtl9937_SetState(intl9937_state* state); 
void intl9937sse_SetState(intl9937sse_state* state); 


void gml9_genPrintState() ; 
void gml9sse_genPrintState ; 

void gm31_genPrintState() ; 
void gm31sse_genPrintState ; 

void gin61_genPrintState() ; 
void gin61sse_genPrintState() ; 
void If srll3_genPrintState() ; 
void If srll3sse_genPrintState() ; 

void inrg32k3a_genPrintState() ; 
void mrg32k3asse_genPrintState() ; 
void inrg32k3asse32_genPrintState() ; 
void intl9937_genPrintState() ; 
void intl9937sse_genPrintState() ; 
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Table 9: Function call interfaces for initializing the generators. 



State type 


Initialize state 


gml9_state 

gml9sse_state 

gm31_state 

gm31sse_state 

gm61_state 

gm61sse_state 

If srll3_state 

If srll3sse_state 

mrg32k3a_state 

mrg32k3asse_state 

inrg32k3asse32_state 

mtl9937_state 

mtl9937sse_state 


void gml9_genlnit (unsigned seed); 
void gml9sse_genlnit (unsigned seed); 
void gm31_genlnit (unsigned seed); 
void gm31sse_genlnit (unsigned seed); 
void gm61_genlnit (unsigned seed); 
void gm61sse_genlnit (unsigned seed); 

void If srll3_genlnit (unsigned zl, unsigned z2, unsigned z3, unsigned z4) ; 
void If srll3sse_genlnit (unsigned zl, unsigned z2, unsigned z3, unsigned z4) ; 
void mrg32k3a_genlnit (double xO, double xl, double x2, 

double yO, double yl, double y2) ; 
void mrg32k3asse_genlnit (unsigned xO, unsigned xl, unsigned x2, 

unsigned yO, unsigned yl, unsigned y2) ; 
void mrg32k3asse32_genlnit (unsigned xO, unsigned xl, unsigned x2, 

unsigned yO, unsigned yl, unsigned y2) ; 
void mtl9937_genlnit (unsigned long initO, unsigned long initl, 

unsigned long init2, unsigned long init3) ; 
void mtl9937sse_genlnit (unsigned long initO, unsigned long initl, 

unsigned long init2, unsigned long init3) ; 



Table 10: CPU time (sec) for generating lO'' random numbers. Processor; Intel Core 2 Duo E7400. Compilers: gcc 4.3.3, ice 
11.0. 





gcc -OO 


gcc -01 


gcc -02 


gcc -03 


ice -OO 


ice -01 


ice -02 


ice -03 


MRG32k3a 


47.4 


35.7 


36.0 


27.2 


55.9 


32.6 


26.0 


26.0 


MRG32k3aSSE 


10.5 


8.8 


7.2 


7.2 


10.4 


8.7 


7.4 


7.7 


MRG32k3a-4SSE 


7.3 


6.9 


6.9 


6.9 


7.3 


6.9 


6.9 


6.9 


LFSR113 


10.9 


4.6 


4.8 


3.5 


10.8 


5.1 


4.6 


4.6 


LFSR113SSE 


8.0 


7.4 


7.3 


7.3 


8.3 


7.7 


7.5 


7.5 


MT19937 


16.6 


6.1 


6.1 


2.7 


15.8 


6.8 


2.9 


2.8 


MT19937SSE 


5.8 


5.2 


6.1 


2.3 


5.9 


5.3 


2.3 


2.3 


GM19 


616.9 


242.2 


200.1 


144.4 


628.9 


226.9 


125.1 


124.9 


GM19SSE 


32.8 


30.6 


27.8 


27.7 


33.5 


29.8 


28.5 


28.2 


GM31 


1067.9 


255.6 


204.9 


151.4 


1110.5 


318.0 


175.8 


183.0 


GM31SSE 


49.9 


45.5 


41.8 


41.8 


49.5 


44.4 


43.2 


45.0 


GM61 


1794.0 


987.5 


923.6 


844.5 


1908.9 


1002.1 


853.4 


864.4 


GM61SSE 


184.5 


183.0 


181.6 


181.6 


186.0 


192.3 


181.8 


188.7 
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Table 11: CPU time (sec) for generating 10'^ random numbers. Processor: Intel Core 17-940. Compilers: gcc 4.3.3, ice 11.0. 





gcc -OO 


gcc -01 


gcc -02 


gcc -03 


ice -OO 


ice -01 


ice -02 


ice -03 


MRG32k3a 


47.9 


36.3 


35.3 


25.0 


56.1 


33.1 


22.8 


28.1 


MRG32k3aSSE 


9.1 


7.4 


5.8 


5.8 


8.8 


7.4 


6.0 


5.9 


MRG32k3a-4SSE 


6.2 


5.8 


5.8 


5.7 


6.4 


5.9 


5.9 


5.7 


LFSR113 


10.4 


4.8 


6.8 


3.1 


10.2 


5.0 


4.6 


4.5 


LFSR113SSE 


8.0 


6.8 


6.8 


6.9 


7.3 


6.9 


6.6 


6.5 


MT19937 


13.7 


5.7 


6.9 


2.6 


17.5 


6.5 


2.9 


2.9 


MT19937SSE 


5.2 


4.8 


5.5 


2.0 


4.9 


4.7 


2.4 


2.0 


GM19 


578.5 


210.1 


163.1 


117.1 


604.3 


259.4 


110.7 


123.3 


GM19SSE 


32.3 


28.7 


26.7 


29.9 


33.6 


33.0 


27.8 


34.1 


GM31 


999.0 


244.6 


181.6 


134.3 


978.9 


298.8 


143.4 


151.0 


GM31SSE 


39.1 


36.4 


38.5 


58.7 


47.1 


36.4 


35.4 


35.9 


GM61 


1599.7 


893.8 


836.9 


766.6 


1606.5 


870.8 


770.1 


795.1 


GM61SSE 


120.6 


123.0 


116.8 


117.2 


124.6 


120.2 


130.7 


117.7 



Table 12: CPU time (sec) for generating 10^ random numbers. Processor: AMD Turion X2 RM-70. Compilers: gcc 4.3.3, ice 
11.0. 





gcc -OO 


gcc -01 


gcc -02 


gcc -03 


ice -OO 


ice -01 


ice -02 


ice -03 


MRG32k3a 


89.0 


60.9 


60.9 


47.0 


89.1 


69.2 


41.5 


41.6 


MRG32k3aSSE 


25.9 


22.3 


18.4 


18.3 


25.6 


22.3 


19.0 


19.0 


MRG32k3a-4SSE 


18.3 


18.2 


18.2 


18.2 


18.1 


18.2 


18.1 


18.3 


LFSR113 


14.6 


8.7 


9.6 


5.3 


14.9 


9.1 


6.9 


6.8 


MT19937 


31.0 


17.8 


10.8 


7.1 


31.0 


18.7 


5.2 


4.9 


MT19937SSE 


11.3 


10.3 


11.1 


6.6 


10.8 


9.9 


6.0 


6.0 


GM19 


1651.5 


385.1 


259.9 


222.1 


1759.9 


375.4 


198.6 


198.4 


GM19SSE 


96.0 


105.3 


87.5 


87.5 


103.5 


102.8 


88.6 


88.9 


GM31 


1512.5 


436.7 


313.2 


237.3 


1573.8 


506.0 


228.7 


243.9 


GM31SSE 


136.9 


130.0 


130.0 


131.1 


137.1 


130.9 


128.3 


128.5 


GM61 


5138.9 


3826.0 


3991.9 


3567.9 


5221.6 


3849.2 


3604.7 


3586.9 


GM61SSE 


418.3 


413.6 


411.1 


409.7 


427.5 


416.6 


407.7 


410.2 



Table 13: Comparing CPU-times for different versions of MT19937 for generating 10® random numbers via blocks of N numbers. 
Speed is also compajred with the respective realization of MT19937 generator in Intel MKL library. Processor: Intel Core 17-940. 



N 


Intel MKL library 


Our version 




ice -OO 


ice -01 


ice -02 


ice -03 


ice -OO 


ice -01 


ice -02 


ice -03 


1 


24.7 


22.3 


24.0 


22.1 


4.9 


4.7 


2.4 


2.0 


624 


1.8 


1.8 


1.8 


1.8 


1.8 


1.6 


1.4 


1.4 
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Table 14: Comparing CPU-times for different versions of MRG32K3A for generating 10® random numbers via blocks of N 
numbers. Speed is also compared with the respective realization of MRG32K3A generator in Intel MKL library. Processors; 
Intel Xeon 5160 (upper table), Intel Core 17-940 (lower table). 



N 


Intel MKL library 


Our version 




iec -OO 


iec -01 


iec -02 


ice -03 


ice -OO 


ice -01 


ice -02 


ice -03 


1 


37.5 


33.9 


33.0 


33.0 


10.0 


8.2 


7.2 


7.2 


4 


18.4 


18.4 


17.8 


17.8 


7.3 


6.6 


6.6 


6.6 


32 


10.6 


10.6 


10.6 


10.6 


6.6 


6.6 


6.5 


6.5 


1000 


6.4 


6.6 


6.6 


6.6 


6.6 


6.6 


6.6 


6.6 


N 


Intel MKL library 


Our version 




iec -OO 


iec -01 


iec -02 


ice -03 


ice -OO 


ice -01 


ice -02 


ice -03 


1 


31.2 


27.7 


28.1 


29.6 


8.8 


7.4 


6.0 


5.9 


4 


16.3 


16.0 


15.6 


15.7 


6.4 


5.9 


5.9 


5.7 


32 


8.6 


8.6 


8.5 


8.8 


5.4 


5.3 


5.2 


5.2 


1000 


5.3 


5.3 


5.1 


5.1 


5.5 


5.3 


5.3 


5.3 



Table 15: Properties of the generators: numbers of failed tests for the batteries of tests SmallCrush, Crush, Bigcrush |26| . and 
Diehard [26ll . and period lengths. Testing was performed with package TestUOl version TestUOl-1.2.3. For each battery of tests, 
we present three numbers: the number of statistical tests with p- values outside the interval [10"'^, 1 — 10~^], number of tests 
with p-values outside the interval [10~^, 1 — 10~^], and number of tests with p-values outside the interval [10"^", 1 — 10"^"]. 



Generator 


SmallCrush 


Diehard 


Crush 


Bigcrush 


Period 


MRG32k3a 


0,0,0 


0,0,0 


0,0,0 


0,0,0 


3.1 • 10" 


LFSR113 


0,0,0 


1,0,0 


6, 6, 6 


6, 6, 6 


1.0 • 10^4 


MT19937 


0,0,0 


0,0,0 


2,2,2 


2,2,2 


4.3 ■ 106"oi 


GM19SSE 


0,0,0 


0,0,0 


0,0,0 


0,0,0 


2.7-10" 


GM31SSE 


0,0,0 


0,0,0 


0,0,0 


0,0,0 


4.6-101* 


GM61SSE 


0,0,0 


0,0,0 


0,0,0 


0,0,0 


5.3 - 10^6 
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